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Abstract 

Multi-output Gaussian processes (MOGP) are probability distributions over vector-valued functions, and 
have been previously used for multi-output regression and for multi-class classification. A less explored facet 
of the multi-output Gaussian process is that it can be used as a generative model for vector-valued random 
fields in the context of pattern recognition. As a generative model, the multi-output GP is able to handle 
vector-valued functions with continuous inputs, as opposed, for example, to hidden Markov models. It also 
offers the ability to model multivariate random functions with high dimensional inputs. In this report, we use a 
discriminative training criteria known as Minimum Glassification Error to fit the parameters of a multi-output 
Gaussian process. We compare the performance of generative training and discriminative training of MOGP 
in emotion recognition, activity recognition, and face recognition. We also compare the proposed methodology 
against hidden Markov models trained in a generative and in a discriminative way. 


1 Introduction 


A growing interest within the Gaussian processes community in Machine learning has been the formulation of 
suitable covariance functions for describing multiple output processes as a joint Gaussian process. Examples 


2008 

), or the convolved multi-output Gaussian process . 

3oyle and Frean ( 

2005) 

Alvarez and Lawrence 

(2009 


Each of these methods uses as a model for the covariance function either a version of the linear model of coregion¬ 
alization (LMG) Goovaerts (1997) or a version of process convolutions (PC) [Higdon (2002). Different altern atives 


Alvarez et al. 


( 2012 ). 


for building covariance functions for multiple-output processes have been reviewed by 
Multiple output GPs have been used for supervised learning problems, specifically, for multi-output regression 


Bonilla et al. (2008), and multi-class classification Skolidis and Sanguinetti (2011); Chai (2012). The interest 


has been mainly on exploiting the correlations between outputs to improve the prediction performance, when 
compared to modeling each output independently. In particular, a Gaussian process is used as a prior over 
vector-valued functions f(x) mapping from x S to f S Components of f may be continuous or discrete. 
In this report, we advocate the use of multi-output GPs as generative models for vector-valued random fields, this 
is, we use multi-output GPs to directly modeling p(f(x)). Afterwards, we use this probabilistic model to tackle a 
classification problem. An important application area where this setup is of interest is in multivariate time series 
classification. Here the vector-valued function f is evaluated at discrete values of x, and it is typically modeled 
using an unsupervised learning method like a hidden Markov model (HMM) or a linear dynamical system (LDS) 
Bishop (2007). Notice that by using a multi-output GP to model f(x), we allow the vector-valued function f(x) 
to be continuous on the input space. Furthermore, we are able to model multivariate random functions for which 
p > 1. It is worth mentioning that the model we propose here, is different from classical GP classification as 


explained for example in Rasmussen and Williams (2006). In standard GP classification, the feature space is not 


assumed to follow a particular structure, whereas in our model, the assumption is that the feature space may be 
structured, with potentially correlated and spatially varying features. 

As a generative model, the multi-output Gaussian process can be used for classification: we fit a multi-output 
GP for every class independently, and to classify a new vector-valued random field, we compute its likelihood for 
each class and make a decision using Bayes rule. This generative approach works well when the real multivariate 
signal’s distribution is known, but this is rarely the case. Notice that the optimization goal in the generative 


1 


















































model is not a function that measures classification performance, but a likelihood function that is optimized for 
each class separately. 

An alternative is to use discriminative training Jebara (2004) for estimating the parameters of the multi-output 
GP. A discriminative approach optimizes a function classification performance directly. Thus, when the multi¬ 
output GP is not an appropriate generative distribution the results of the discriminative training procedure 
are usually better. There are different criteria to perform discriminative training, including maximum mutual 
information (MMI) Gopalakrishnan et al. (1991), and minimum classification error (MGE) Juang et al. (1997). 
In this report we present a discriminative approach to estimate the hyperparameters of a multi-output Gaussian 
Process (MOGP) based on minimum classification error (MGE). In sectionj^we review how to fit the multi-output 
GP model using the generative approach, and then we introduce our method to train the same MOGP model with 
a discriminative approach based on MGE. In section]^ we show experimental results, with both the generative 
and discriminative approaches. Finally, we present conclusions on section]^ 


2 Generative and discriminative training of multi-output GPs 

In our classification scenario, we have M classes. We want to come up with a classifier that allows us to map the 
matrix F(X) to one of the M classes. Golumns of matrix X are input vectors x„ G and columns of matrix F 
are feature vectors f(x„) G K^, for some n in an index set. Rows for F correspond to different entries of f(x„) 
evaluated for all n. For example, in a multi-variate time series classification problem, x„ is a time point and 
f(x„) is the multi-variate time series at x„ = tn. Rows of the matrix F are the different time series. 

The main idea that we introduce in this report is that we model the class-conditional density p(F|X,Cm,0m) 
using a multi-output Gaussian process, where Cm is the class m, and 6m are hyperparameters of the multi-output 
GP for class m. By doing so, we allow correlations across the columns of F, this is between f(x„) and f(xm), for 
n ^ m, and also allow correlations among the variables in the vector f(x„), for all n. We then estimate 6m for all 
TO in a generative classification scheme, and in a discriminative classification scheme using minimum classification 
error. Notice that a HMM would model p(F|Cm, 6m), since vectors would be already defined for discrete values 
of X. Also notice that in standard GP classification, we would model p(Cm|F), but with now particular correlation 
assumptions over the entries in F. 

Available data for each class are matrices Fm, where to = and I = l,...,Lm- Index I runs over 

the instances for a class, and each class has Lm instances. In turn, each matrix Fm S with columns 

f^(x„) G K'°, x„ G K^, and n = l,...,iV^. To reduce clutter in the notation, we assume that Lm = L for 
all TO, and = N for all to, and 1. Entries in f),j(x„) are given by /^’™(x„) for d = 1,..., H. We define the 
vector with elements given by {/^’™(x„)}()L^. Notice that the rows of Fm are given by Also, vector 

im = We use F^ to collectively refer to all matrices {Fm}fL-^, or all vectors {fm}iLi- We 

use X(„ to refer to the set of input vectors {x„}()L^ for class to , and instance 1 . X™ refers to all the matrices 
{^m}iLi- Likewise, © refers to the set {6m}m=i- 


2.1 Multiple Outputs Gaussian Processes 

According to [Rasmussen and Williams (20061, a Gaussian Process is a collection of random variables, any finite 
number of which have a joint Gaussian distribution. We can use a Gaussian process to model a distribution 
over functions. Likewise, we can use a multi-output Gaussian process to model a distribution over vector-valued 
functions f(x) = [/i(x)... /^((x)]^. The vector valued function f(x) is modeled with a GP, 

f(x) ~ 075(0, k(x,x')). 


Alvarez et al. 


where k(x,x') is a kernel for vector-valued functions 
range of alternatives for (x, x') can be summarized using the general expression 


(2012), with entries given by (x, x'). A 



Gji,,(x,z)G: 


d' ,q 


(x', ’z,')kq{’z,, z'jdz'dz. 


( 1 ) 


where Q is the number of latent functions used for constructing the kernel; Rq is the number of latent functions 
(for a particular q) sharing the same covariance; ^(x — z) is known as the smoothing kernel for output d, and 
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kq{z,z') is the kernel of each latent function q. For details, the reader is referred to 


Alvarez et al. 


(20121. In the 


linear model of coregionalization, GJj ^(x, z) = ^6{x — z), where ajj ^ are constants, and S{-) is the Dirac delta 

function. 

In all the experimental section, we use a kernel of the form ([^ with Rq = 1. We also assume that both ^(x, z) 
and kq{z,z') are given by Gaussian kernels of the form 


fc(x, x') = -—-—exp 
^ ’ (27r)p/2 ^ 


-i(x-x')^A(x-xO 


where A is the precision matrix. 

Given a set of input vector X = {x„}^^j^, the columns of the matrix F correspond to the vector-valued function 
f(x) evaluated at X. Notice that the rows in F are vectors fj = [/d(xi).../d(xAr)]. In multi-output GP, we 
model the vector f = by f ^ A/'(0,K), where K G ]^-^dxnd^ entries in K are computed 

using kf^j^, (x„, Xm), for all d,d' = 1,..., D, and n, to = 1 ,..., A. 


2.2 Generative Training 

In the generative model, we train separately a multi-output GP for each class. In our case, training consists of 
estimating the kernel hyperparameters of the multi-output GP, ©m- Let us assume that the training set consists 
of several multi-output processes grouped in F^ and drawn independently, from the Gaussian process generative 
model given by 

p(C|Xt„,C^, = A(fi|0, K^), 

where is the kernel matrix for class to, as explained in section [ 2 . 1 [ 

In order to train the generative model, we maximize the log marginal likelihood function with respect to the 
parameter vector 9^. As we assumed that the different instances of the multi-output process are generated inde¬ 
pendently given the kernel hyperparameters, we can write the log marginal likelihood for class to, log(p(Fm| 0 m)), 
as 


- log \Km\]-^ log(27r). (2) 

1=1 

We use a gradient-descent procedure to perform the optimization. 

To predict the class label for a new matrix F* or equivalently, a new vector f*, and assuming equal prior proba¬ 
bilities for each class, we compute the marginal likelihood p(F*|X*,Cm, ^m) for all to. We predict as the correct 
class that one for which the marginal likelihood is bigger. 


2.3 Discriminative Training 

In discriminative training, we search for the hyperparameters that minimize some classification error measure 
for all classes simultaneously. In this report, we chose to minimize the minimum classification error criterion as 
presented in Juang et al. (19971. A soft version of the {0,1} loss function for classihcation can be written as 

1 


imif) = 


1 -f exp(-7id^(f) -h 72) ’ 

where 71 > 0 and 72 are user given parameters, and dmi^) is the class misclassification measure, given by 


(3) 


rfm(f) = -5m (f) +l 0 g 


1 






M -1 


Vfe 

k^m 


(4) 


where 5 > 0 , and 5 m(f) = a logp(f |X, Cm, ^m) + b = a logA/’(f |0, K^) + b. Parameters a > 0 , and b are again 
defined by the user. Expression 5 m(f) is an scaled and translated version of the log marginal likelihood for the 
multi-output GP of class to. We scale the log marginal likelihood to keep the value of 5 m(f) in a small numerical 
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range such that computing exp{gk{t)r]) does not overflow the capacity of a double floating point number of a 
computer. Parameters 71 and 72 in equation @ have the same role as a and b in gm{^)^ but the numerical 
problems are less severe here and setting 71 = 1 and 72 = 0 usually works well. 

Expression in equation Q converges to — ^^(f) + inaxvfc:fc 5 ,^m 5fc(f) as g tends to infinity. For finite values of g, 
fim(f) is a differentiable function. The value of dm{t) is negative if gm{i) is greater than the “maximum” of gk{i), 
for k ^ m. We expect this to be the case, if f truly belongs to class Cm- Therefore, expression dm(f) plays the 
role of a discriminant function between gm(f) and the “maximum” of 5 fe(f), with k ^ mQ The misclassification 
measure is a continuous function of ©, and attempts to model a decision rule. Notice that if dm(f) < 0 then 
(.m{^) goes to zero, and if > 0 then ^m(f) goes to one, and that is the reason as why expression ([^ can be 

seen as a soft version of a {0,1} loss function. The loss function takes into account the class-conditional densities 
p(f |X, Cm, ^m), for all classes, and thus, optimizing £m(f) implies the optimization over the set ©. 

Given some dataset {Xm,Fm}m=iJ fti® purpose is then to find the hyperparameters © that minimize the cost 
function that counts the number of misclassification errors in the dataset, 

M L 

£({Xm}^=l,{Fm}"=i,©) = ^ (5) 

m—1 1 — 1 

We can compute the derivatives of equation ([^ with respect to the hyperparameters ©, and then use a gradient 
optimization method to find the optimal hyperparameters for the minimum classification error criterion. 


2.3.1 Computational complexity 

Equation 0 requires us to compute the sum over all possible classes to compute the denominator. And to 
compute equation (|^, we need to invert the matrix of dimension DN x DN. The computational complexity 
of each optimization step is then 0{LMD^N^), this could be very slow for many applications. 

In order to reduce computational complexity, in this report we resort to low rank approximations for the co- 
variance matrix appearing on the likelihood model. In particular, we use the partially independent training 
conditional (PITC) approximation, and the fully independent training conditional (FITC) approximation, both 
approximations for multi-output GPs Alvarez and Lawrencej ( 2011[ ). 

These approximations reduce the complexity to 0{LMK'‘‘‘DN), where AT is a parameter specified by the user. 
The value of K refers to the number of auxiliary input variables used for performing the low rank approximations. 
The locations of these input variables can be optimized withing the same optimization procedure used for finding 
the hyperparameters ©. For details, the reader is referred to Alvarez and Lawrence (2011). 


3 Experimental Results 

In the following sections, we show results for different experiments that compare the following methods: hidden 


Markov models trained in a generative way using the Baum-Welch algorithm 

Rabiner 

(1 

989 

1 , 

bidden Markov 

models trained in a discriminative way using minimum classification error Juang et al. 

(1997 

), multi-output 


GPs trained in a generative way using maximum likelihood (this report), and multi-output GPs trained in a 
discriminative way using minimum classification error (this report). On section [3T] we test th e different methods, 
for emotion classification from video sequences on the Cohn-Kanade Database Lucey et al. (2010). On section 
3.2 we compare the methods for activity recognition (Running and walking) from video sequences on the GMU 
MOCAP Database. On section |3.3[ we use again the GMU MOCAP database to identify subjects from their 


walking styles. For this experiment, we also try different frame rates for the training set and validation cameras 


to show how the multi-output GP method adapts to this case. Finally on section 3.4 we show an example of face 
recognition from images. Our intention here is to show our method on an example in which the dimensionality 
of the input space, p, is greater than one. 

For all the experiments, we assume that the HMMs have a Gaussian distribution per state. The number of hidden 
states of a HMM are shown in each of the experiments in parenthesis, for instance, HMM(q) means a HMM with 
q hidden states. 


^We use quotes for the word maximum, since the true maximum is only achieved when rj tends to infinity. 
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Method 

Generative 

MCE 

FITC 

79.58 ± 4.52 

89.55 ± 2.53 

PITC 

69.16 ± 16.82 

87.07 ± 5.77 

FITC* 

79.16 ± 3.29 

89.16 ± 6.49 

PITC* 

70.83 ± 16.21 

85.82 ± 3.73 

HMM (3) 

85.80 ± 7.26 

84.15 ± 9.60 

HMM (5) 

79.00 ± 3.74 

87.91 ± 4.27 

HMM (7) 

70.80 ± 8.87 

91.66 ± 6.08 


Table 1: Classification accuracy (mean and standard deviation) for emotion recognition on the Cohn-Kanade database 
using dynamic time warping on the features. The asterisk in the table means that we also optimized the locations of the 
auxiliary variables for the low rank approximation. 


3.1 Emotion Recognition from Sequence Images 

For the first experiment, we used the Cohn-Kanade Database |Lucey et al. (2010). This database consists of 
processed videos of people expressing emotions, starting from a neutral face and going to the final emotion. The 
features are the positions {x, y) of some key-points or landmarks in the face of the person expressing the emotion. 
The database consists on seven emotions. We used four emotions, those having more than 40 realizations, namely, 
anger, disgust, happiness, and surprise. This is M = 4. Each instance consists on 68 landmarks evolving over 
time. Figure 1 shows a description for the Cohn-Kanade facial expression database. We employed 19 of those 68 
key points (see figure lb), associated to the lips, and the eyes among others, and that are thought to be more 
relevant for emotion recognition, according to Valstar and Pantic (2012[). Figureshows these relevant features. 



Figure 
by CK 


1: Cohn-Kanade emotion recognition example, a) Sample faces from CK database, b) Facial landmarks provided 
database 


Lucey et al. (20101. c) Key shape points according to 


Valstar and Pantic 


( 2012 ) 


In this experiment, we model the coordinate x, and the coordinate y of each landmark, as a time series, this is, 
x(t), and y(t). With 19 landmarks, and two time series per landmark, we are modeling multivariate time series 
of dimension D = 38. The length of each ti me series N was fixed in this first e xperiment to 71, using a dynamic 
time warping algorithm for multiple signals Zhou and De la Torre Frade (2012). Our matrices S For 

each class, we have L = 40 instances, and use 70% for the training set, and 30% for validation set. We repeated 
the experiments five times. Each time, we had different instances in the training set and the validation set. 
We trained multi-output CPs with FITC and PITC approximations, both fixing, and optimizing the auxiliary 
variables for the low rank approximations. The number of auxiliary variables was K = 25. When not optimized, 
the auxiliary variables were uniformly placed along the input space. 

Accuracy results are shown in TableThe table provides the mean, and the standard deviation for the results of 
the five repetitions of the experiment. The star symbol (*) on the method name means that the auxiliary input 
points were optimized, otherwise the auxiliary points were fixed. 

Gen and Disc refer to the generative and discriminative training, using either the multi-output GP or the HMM. 
The table shows that for multi-output GPs, discriminative training leads to better results than the generative 
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Figure 2: Sample actions from MOCAP database, run action (above) and walk action (below). 


Method 

Generative 

Discriminative 

FITC 

60.68 ± 3.98 

95.71 ± 

2.98 

PITC 

76.40 ± 12.38 

93.56 ± 

5.29 

FITC* 

58.90 ± 0.00 

96.78 ± 

1.95 

PITC* 

69.28 ± 15.28 

84.90 ± 

11.33 

HMM (3) 

96.70 ± 2.77 

97.95 ± 

2.23 

HMM (5) 

94.69 ± 4.36 

96.32 ± 

0.82 

HMM (7) 

92.24 ± 4.49 

99.77 ± 

0.99 


Table 2: Classification accnracy rates (mean and standard deviation) for activity identification on the CMU-MOCAP 
database. 


training. The table also shows results for HMM with 3, 5 and 7 hidden states respectively. Results for the HMM 
with generative training, and the multi-output GP with generative training are within the same range, if we take 
into account the standard deviation. Accuracies are also similar, when comparing the HMM trained with MCE, 
and the multi-output GP trained with MCE. We experimentally show then that the multi-output GP is as good 
as the HMM for emotion recognition. 


3.2 Activity Recognition With Motion Captnre Data Set 


For the second experiment, we use a motion capture database (MOCAP) to classify between walking and running 
actions. In MOCAP, the input consists of the different angles between the bones of a 3D skeleton. The camera 
used for the MOCAP database has a frame rate of 120 frames per second, but in this experiment we sub-sampled 
the frames to | of the original frame rate. Our motion capture data set is from the CMU motion capture data 
basej^ We considered two different categories of movement: running and walking. For running, we take subject 2 
motion 3, subject 9 motions 1—11, subject 16 motions 35, 36, 45, 46, 55, 56, subject 35 motions 17—26, subject 
127 motions 3, 6, 7, 8, subject 141 motions 1, 2, 3 34, subject 143 motions 1, 42, and for walking we take subject 
7 motions 1—11, subject 8 motions 1—10, subject 35 motions 1—11, subject 39 motions 1—10. Figure 2 shows an 
example for activity recognition in MOCAP database. In this example then, we have two classes, M = 2, and 
D = 62 time courses of angles, modeled as a multi-variate time series. We also have Li = 38 for running, and 
L 2 = 42 for walking. 

Here again we compare the generative and discriminative approaches on both our proposed model with FITC and 
PITC approximations and HMMs. Again, we assume K = 25. One important difference between the experiment 
of section [3T| and this experiment is that we are using the raw features here, whereas in the experiment of section 


3.1 


we first performed dynamic time warping to make that all the time series have the same length. It means 
that for this experiment, actually depends on the particular m, and 1. 

The results are shown in Table for five repeats of the experiment. For each repeat, we used 15 instances for 
training, and the rest in each class, for validation. Again the results are comparable with the results of the HMM 


^The CMU Graphics Lab Motion Capture Database was created with funding from NSF EIA-0196217 and is available at http: 
//mocap.cs.cmu.edu. 
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Figure 3: Sample walk actions from CMU MOCAP database, subjects 7, 8 and 35. 


within the standard deviation. As before, the discriminative approach shows in general better results than the 
generative approach. 

Experiments in this section and section [3A] show that multi-output GPs exhibit similar performances to HMM, 
when used for pattern recognition of different types of multi-variate time series. 

3.3 Subject Identification on a Motion Capture Data Set 

For the third experiment we took again the CMU MOCAP database but instead of classifying between differ¬ 
ent actions, we recognized subjects by their walking styles. We considered three different subjects exhibiting 
walk movements. To perform the identification we took subject 7 motions 1,2,3,6,7,8,9,10, subject 8 motions 
1,2,3,5,6,8,9,10, and subject 35 motions 1—8. Then for each subject we took four instances for training and 
other four repetitions for validation. Figure 3 shows an example for subject identification in the CMU-MOCAP 
database. We then have M = 3, D = 62, L = 8, and the length for each instance, was variable. 

For this experiment, we supposed the scenario where the frame rate for the motions used in training could be 
different from the frame rate for the motions used in testing. This configuration simulates the scenario where 
cameras with different recording rates are used to keep track of human activities. Notice that HMMs are not 
supposed to adapt well to this scenario, since the Markov assumption is that the current state depends only on 
the previous state. However, the Caussian process captures the dependencies of any order, and encodes those 
dependencies in the kernel function, which is a continuous function of the input variables. Thus, we can evaluate 
the CP for any set of input points, at the testing stage, without the need to train the whole model again. 

Table shows the results of this experiment. In the table, we study three different scenarios: one for which the 
frame rate in the training instances was slower than in the validation instances, one for which it was faster, and 
one for which it was the same. We manipulate the frame rates by decimating in training (DT), and decimating 
in validation (DV). For example, a decimation of means that one of each 16 frames of the original time 
series is taken. When the validation frame rate is faster than the training frame rate (column Faster in Table 
[^, the performance of the multi-output CP is clearly superior to the one exhibited by the HMM, both for the 
generative and the discriminative approaches. When the validation frame rate is slower or equal than the training 
frame rate (columns Slower and Equal), we could say that the performances are similar (within the standard 
deviation) for multi-output CPs and HMM, if they are trained with MCE. If the models are trained generatively, 
the multi-output CP outperforms the HMM. Although the results for the HMM in Table were obtained fixing 
the number of states to seven, we also performed experiments for three and five states, obtaining similar results. 
This experiment shows an example, where our model is clearly useful to solve a problem that a HMM does not 
solve satisfactorily. 
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Method I Faster Slower Equal 


FITG Gen 

93.28 

± 

3.76 

94.96 

± 

4.60 

94.96 

± 

4.60 

PITC Gen 

93.28 

± 

3.76 

94.96 

± 

4.60 

94.96 

± 

4.60 

FITC MCE 

94.96 

± 

4.60 

89.96 

± 

9.12 

89.96 

± 

9.12 

PITC MCE 

94.96 

± 

4.60 

93.28 

± 

3.76 

88.32 

± 

12.63 

HMM Gen (7) 

33.33 

± 

0.00 

36.40 

± 

12.56 

81.60 

± 

6.98 

HMM MCE (7) 

83.33 

± 

16.6 

94.90 

± 

4.60 

100.00 

± 

0.00 


Table 3: Classification accuracy rates (mean and standard deviation) of subject identification by his walking style on the 
CMU-MOCAP database. The first column (Faster) shows the results for when the validation camera is 4 times faster than 
the training camera (DT = DV = |). The second column (Slower) shows the results for when the validation camera 
is 4 times slower than the training camera (DT = DV=^). The last column (Equal) shows the results for when both 

the validation and training cameras have the same frequency (DT = DV=|). 


3.4 Face Recognition 

The goal of the fourth experiment is to show an example where the vector-valued function is dependent on input 
variables with dimensionality greater than one, functions of multi-dimensional inputs (/(xq), /(xi),/(x„)) like 
space. The HMMs as used here are not easily generalized in this case and, thus, we do not present results with 
HMMs for this experiment. In this problem we work with face recognition from pictures of the Georgia Tech 
database]^ This database, contains images of 50 subjects stored in JPEG format with 640 x 480 pixel resolution. 
For each individual 15 color images were taken, considering variations in illumination, facial expression, face 
orientations and appearance (presence of faces using glasses). Figure shows an example for the Georgia Tech 
Face database. 



Figure 4: Sample faces from Georgia Tech Face database. 


Here we did two experiments. The first experiment was carried out taking 5 subjects of the Georgia Tech database 
that did not have glasses. For the second experiment we took another 5 subjects of the database that had glasses. 
In both experiments, each image was divided in a given number of regions of equal aspect ratio. For each region n 
we computed its centroid x„ and a texture vector f„. Notice that this can be directly modeled by a multi-output 
GP where the input vectors x„ are two dimensional. 

Tables iHli and show the results of this experiment with the discriminative and the generative training 
approaches. The number of divisions in the X and Y coordinates are BX and BY respectively. The features 


extracted from each block are mean RGB values and Segmentation-based Fractal Texture Analysis (SFTA) Costa 


et al. (2012) of each block. The SFTA algorithm extracts a feature vector from each region by decomposing it 


into a set of binary images, and then computing a scalar measure based on fractal symmetry for each of those 
binary images. 

The results show high accuracy in the recognition process in both schemes (Faces with glasses and faces without 
glasses) when using discriminative training. For all the settings, the results of the discriminative training method 


^Georgia Tech Face Database, http://www.anefiaii.coin/research/face_reco.htni 













Method 

Gen 

Disc 

FITC 

61.57 ± 3.50 

86.84 ± 0.01 

PITC 

64.72 ± 2.34 

95.78 ± 8.03 

FITC* 

66.71 ± 3.82 

96.84 ± 7.06 

PITC* 

73.68 ± 5.88 

96.30 ± 3.00 


Table 4: Recognition accuracy (mean and standard deviation) for faces without glasses using a grid of size BX=4, BY=7. 


Method 

Gen 

Disc 

FITC 

51.57 ± 3.5 

88.42 ± 2.35 

PITC 

69.47 ± 3.53 

83.68 ± 4.30 

FITC* 

56.80 ± 2.44 

86.84 ± 0.01 

PITC* 

62.10 ± 8.24 

87.36 ± 1.17 


Table 5: Recognition accuracy (mean and standard deviation) for faces without glasses using a grid of size BX=6, BY=7. 


Method 

Gen 

Disc 

FITC 

54.73 ± 6.55 

81.57 ± 3.7 

PITC 

64.21 ± 9.41 

81.57 ± 7.2 

FITC* 

60.53 ± 0.02 

90.52 ± 9.41 

PITC* 

69.47 ± 9.41 

77.36 ± 8.24 


Table 6: Recognition accuracy (mean and standard deviation) for faces with glasses using a grid of BX=4, BY=7. 


Method 

Gen 

Disc 

FITC 

42.1 ± 0.02 

93.68 ± 2.35 

PITC 

35.78 ± 2.35 

86.84 ± 0.01 

FITC* 

72.6 ±5.45 

86.84 ± 0.01 

PITC* 

48.42 ± 2.35 

89.47 ± 0.01 


Table 7: Recognition accuracy (mean and standard deviation) for faces with glasses using a grid of BX=6, BY=7. 
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are better than the results of the generative training method. This experiment shows the versatility of the 
multi-output Gaussian process to work in applications that go beyond time series classification. 


4 Conclusions 


In this report, we advocated the use of multi-output GPs as generative models for vector-valued random fields. We 
showed how to estimate the hyperparameters of the multi-output GP in a generative way and in a discriminative 
way, and through different experiments we demonstrated that the performance of our framework is equal or better 
than its natural competitor, a HMM. 

For future work, we would like to study the performance of the framework using alternative discriminative criteria, 
like Maximum Mutual Information (MMI) using gradient optimization or Gonditional Expectation Maximization 


Jebara (2004). We would also like to try practical applications for which there is the need to classify vector-valued 


functions with higher dimensionality input spaces. Computational complexity is still an issue, we would like to 
implement alternative efficient methods for training the multi-output GPs Hensman et al. (2013). 
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